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Abstract 

Molecular descriptors have been explored extensively. From these studies, it is known that a large number of descriptors are 
strongly correlated and capture similar characteristics of molecules. In this paper, we evaluate 919 Dragon-descriptors of 6 
different categories by means of clustering. Also, we analyze these different categories of descriptors also find a subset of 
descriptors which are least correlated among each other and, hence, characterize molecular graphs distinctively. 
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Introduction 

Molecular descriptors map molecular structures to the reals by 
taking physical, chemical or structural information into account 
[1]. A large number of descriptors have been developed to 
describe different properties of molecular graphs. Therefore, these 
descriptors can be classified into different categories depending 
what kind of information is used (e.g., physical, chemical or 
structural information) to defme such a measure. The commercial 
software package Dragon [2] (version 6.0.26) contains 4885 
molecular descriptors which are classified into 29 categories. 

The problem of analyzing molecular descriptors by applying 
clustering techniques has been already explored [3-6]. These are 
usually based on using principal component analysis (PCA) and 
correlation-based methods for the identification of different 
descriptors. For example, Todeschini et al. [6] and Basak et al. 
[3] evaluated descriptors on a rather small collection of molecular 
graphs using PCA and ranked them based on the intercorrelation. 
In order to find similarities between molecular descriptors, Basak 
et al. [4,5] used a PCA-based clustering technique on both a 
hydrocarbon dataset and mixed chemical compounds. Taraviras 
et al. [7] performed a cluster analysis with 240 descriptors by using 
different clustering algorithms. The weak point of the just sketched 
approaches is that the corresponding study has not been 
performed on a large scale (large data sets) and with distinct 
descriptors belonging to several categories. Also, the optimal 
number of different descriptors (dimension) has not been validated 
statistically. In this paper, we overcome these problems. 

A thorough evaluation of the vast amount of developed 
descriptors [1] is required to identify categories of descriptors 
which capture structural information differently. In our analysis 
we evaluate 6 categories (see next section) of structural descriptors 
by means of clustering. The main contribution of this paper is to 
explore the dimension of the descriptor space, i.e., how many 
different descriptors exist among all which have been introduced 



so far. Here, we put the emphasis on 919 structural descriptors 
from Dragon. In particular, we find that only a very few 
descriptors are different. In this context that means they are 
least correlated and, therefore, capture structural information 
differently. 

Methods and Results 

Molecular Descriptors 

To perform our study, we used six categories of descriptors 
implemented in Dragon (version 6.0.26) which are defined as 
follows: 

1. Connectivity indices [1]: These indices are calculated from 
the vertex-degree of a molecular graph. The Randic index [8] 
is a prominent example thereof 

2. Edge adjacency indices [1]: These indices are based on the 
edge adjacency matrix of a graph. The resulting descriptor- 
value is the sum of all edge entries of the adjacency matrix of a 
graph. Balaban et al. [9] developed several indices by using 
graph-theoretical matrices. 

3. Topological indices [1]: These structural graph measures 
which take various structural features into account, e.g., 
distances and eigenvalues. The term topological index has been 
firstly coined by Hosoya [10]. The first and the second Zagreb 
indices [1 1] are prominent examples thereof 

4. Walk path counts [1]: These indices are defined by counting 
paths or walks of a graph. Here, the term walk refers to random 
walks which is based on using a probability measure. We point 
out that such indices have been listed by Todeschini and 
Consonni [1]. 

5. Information indices [1]: These measures are based on using 
Shannon's entropy. To assign a probability value to a graph. 
Dragon uses so-called partition-based methods [12] by using 
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several graph invariants such as vertices, edges, vertex degrees 
and distances have been used [12]. The so-called topological 
information content [13] and the Bonchev-Trinajstic index 
[14] are prominent examples of partition-based information 
indices. So-caUed partition-independent information-theoretic 
measures for graphs have been developed by Dehmer [12]. 
6. 2D Matrix-based [1]: These descriptors are calculated based 
on the elements of so-called graph-theoretical matrices [15] by 
using several algebraic operations. The Balaban-like indices 
inferred from the adjacency matrix [2,9] are important 
examples of this category. 

We want to emphasize that the term 'Topological indices' is 
here misleading and ambiguous. For example, typical information 
indices are based on structural features of a graph by using 
Shannon's entropy. So, they represent topological indices too. The 
same holds for all other groups which have been defined by using 
structural features of molecular structures and, therefore, they are 
topological indices as well, see [1,9,16-19]. 

Data 

In order to evaluate the above mentioned 6 categories of 
descriptors, we use 3 data sets namely: 

1 . MS2265 contains (non-isomorphic) molecular structures (only 
skeletons, i.e., without vertex- and edge labels) inferred from 
the NIST spectral database [20]. 

2. Ci5 contains exhaustively generated (non-isomorphic) tree 
structures with 15 vertices each [20]. 

3. contains exhaustively generated (non-isomorphic) graphs 
with 8 vertices each [21]. 

To perform our analysis, we calculate the descriptor values for 
these three datasets. We removed those descriptors which give 
constant and erroneous values by using the three data sets. The 
erroneous values are produced by those descriptors for which we 
have not been able to calculate a descriptor value of a network 
without additional physical or chemical information. Finally, we 
the above mentioned six categories contain 24, 301, 57, 28, 40, 
469 descriptors. 

Clustering Techniques 

Clustering is an unsupervised learning technique which aims to 
find different groups or clusters of objects in data [22]. The groups 
are described as a collection of objects which are closer to each 
other than the rest of the objects [22]. An example thereof is 
hierarchical clustering as groups of the objects are arranged in a 
hierarchical order by a so-called dendogram. The objects which 
are clustered in one group have a higher degree of similarity than 
the objects which are clustered in different groups. Thus a 
resulting clustering solution allows to determine clusters where 
each cluster shows distinct property of the data. The similarity or 
dissimilarity between two objects is usually determined by using a 
Similarity/distance function which measures the similarity/ 
distance between data points of different objects. Examples are 
the Euclidean distance, the Manhattan distance or the correlation- 
based distance. A dissimilarit)' can be described as follows: 

Several algorithms have been developed for cluster analysis 
[22]. These algorithms can be divided into several categories 
namely partition-based clustering, hierarchicEil clustering, density- 
based clustering, grid-based clustering and fuzzy clustering 
[22,23]. Thus k-means, soft k-means Clustering, k-medoids 
Clustering [22] are some examples representing non-hierarchical 
clustering methods. Hierarchical clustering itself can be divided 



into two categories called agglomerative and divisive clustering 
[22]. As known, several concrete methods thereof have been 
developed such as single linkage, complete linkage and average 
linkage, see [22]. 

In order to evaluate the descriptors, we perform hierarchical 
clustering (average linkage) by using the mentioned Dragon 
descriptors and the Spearman rank correlation as a distance 
measure. Here, we denote the correlation matrix between 
descriptors as Z. Then, the distance between a pair of descriptors 
is defined by. 



dij = l- 



(1) 



In order to choose a clustering method we use the cophenetic 
correlation measure [24] . A high correlation coefficient shows that 
the distance between the data points is well preserved by the 
created dendogram of the hierarchical clustering solution. In our 
analysis, the cophenetic correlation coefficient is highest for the 
average clustering solution for all three data-set compared to other 
clustering algorithms. We calculate the cophenetic correlation for 
seven hierarchical clustering algorithms which are the Ward, 
Single, Complete, Average, Mcquitty, Median and the Centroid- 
method. The cophentic correlation coefficients for the average 
clustering solutions for three data-sets are 0.84, 0.89 and 0.93. 

Cluster Validity 

Cluster validity [23,25] is used to evaluate the quafity of 
clustering solution (by using a certain clustering algorithm), e.g., 
the optimum number of clusters in the data, or whether the 

resulting cluster solution fits the data. Known clustering validation 
techniques are divided into three categories namely internal, 
external and relative validity criteria. Ext(;rnal validation criteria 
evaluate clustering solutions with a predefined clustering structure. 
Using internal validation criteria relates to find the optimal 
number of clusters which is based on the intrinsic knowledge of 
data. Relative validation criteria are used to compare two different 
clustering solutions [23]. 

In order to perform analyses, we use external and internal 
clustering validation criteria. For the external vadidation, we 
compared the clustering solution with a predefined group of 
clusters which serve as reference clusters. The external clustering 
validit)' of a clustering solution with respect to the given reference 
cluster is estimated by using the information-theoretic quantity 
NMI^ax (normalized mutual information) [26,27] defined by 



NMImax = 



I{U,V) 



max{H{U),H{V)} ' 



where 



/.(F)=-S,^,|log(| 



(2) 



(3) 



(4) 



(5) 
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Table 1. A contingency table which defines the overlap 
between two cluster solutions, U and V. 
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Hereby, we assume that we have two clustering solutions U and 
V which have R and C clusters. The overlap between these two 
clusters is shown in the contingency Table 1. We calculated 
NMIijjax for all three data-sets with difiFerent number of clusters. 

The Optimal Number of Clusters 

The optimal number of clusters (internal cluster validity) are 
determined by consensus clustering [27,28] which has been here 
performed as follows. Assume we evaluate N descriptors on a 
dataset containing n molecular graphs. Thus we get n descriptor 
values for each descriptor. First, we resample the data of sample- 
size, p<«, _S= 100 times for descriptors to generate B clustering 
solutions Uk = {Ul,Ul . . . U^}, for k clusters, where k = 2,3, 
. . . ,200. After that we calculate the consensus indices for each 




n 




Figure 1 . Hierarchical clustering using tKie average algorithm, MS2265 (left), C15 (middle), Ns (right). The total number of descriptors equals 
919. They belong to 6 different categories which are as follows: connectivity indices (24), edge adjacency indices (301), topological indices (57), walk 
path counts (28), information indices (40) and 2D Matrix-based (469). 
doi:1 0.1 371/journal.pone.0083956.g001 
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Figure 2. The normalized mutual information, NMl„un-, between reference clusters, RC, and the number of clusters, K, obtained by 
hierarchical clustering for three data-sets MS2265 (left), C15 (right) and A": (bottom). NMIi,,,,^ for each K has been generated by 

sampling the data sets Db, where B= 1, . . . ,100 (data set MS'2265)- The total number of descriptors equals 919. They belong to 6 different 
categories which are as follows: connectivity indices (24), edge adjacency indices (301), topological indices (57), walk path counts (28), information 
indices (40) and 2D iVlatrix-based (469). 
doi:1 0.1 371 /journal.pone.0083956.g002 



cluster, k, which is defined as follows: 



CI{Uk)-- 



B(B-\)/2 



(6) 



As to the measure AM, we use the adjusted rand mdex ARI 
[29] defined by. 



ARI-- 



N 



(7) 



The number of clusters k for which CI attains its maximum is 
chosen as the optimal number of clusters, namely. 
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Figure 3. Consensus indices using the adjusted rand index for estimating the number of clusters in the data. These plots have been 
generated by sampling the data sets B, where 5= 1, ... ,100 for the three data sets, MS2265 (left), C\s (right), Ng (bottom). The dotted red line shows 
the optimal number of clusters. 
doi:1 0.1 371 /journal.pone.0083956.g003 



^optimal = argmaxk = 2,...,kmax CI( Uk). 



(8) 



Determining a Highly Correlated Subset of Descriptors 

Let Z) be a set of descriptors and \D\ is its cardinality. Let S be a 
subset ofD. The selected \D\ =919 descriptors can be reduced to a 
set of descriptors, S^D. The remaining — |S| descriptors will 
have a significant correlation with at least one of the descriptor in 
the set S and the descriptors in S are not significantly correlated. If 
two descriptors are showing a significant correlation with each 
other, then we conclude that they capture structural information 
similarly. In order to predict the significance of the correlation 
between two descriptors, we perform the following approach: 

Let M be a dataset of N descriptors and n samples. First, we 
generate bootstrap datasets, Mk, k...B = 500 possessing sample 
size p = 200, where p<n. Then, for each dataset, Mk, we perform 



a correlation test [30,31] between each pair of descriptors and 
obtained a p value Py for each pair. Thus, we test N(N —\)/2 
hypotheses for all pairs. In order to control the false positives in the 
multiple hypothesis testing problem, we use the bonfenoni correction 
method for multiple testing correction (MTC) [32] and obtained 
adjusted p-values. For each pair these adjusted p-values are denoted 
by qij. In order to decide whether the correlation between a pair is 
significant, we choose a = 0. 00001. After applying the correlation 
test and MTC, we obtain a binary matrix Imi, which is defined 
follows: 



1 if qij < = 
0 if qij > 01 



(9) 



Finally we calculate a summary-statistic, T(i,j), for each pair of 
descriptors by averaging the values, i.e.. 
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Table 2. The optimal number of clusters for the three data-sets obtained by using consensus indices (CI). 





Data-set 


CI 


# of clusters (|P|) 


# Descriptors in each cluster 






0.942 


5 


|ci| = 863, |q| = 22, |c,| = 18, |c-4| = 1, |f5| = 15 






0.9878 


16 


|ci| = 764, |C2| = 32, |c,| = 12, 1^1 =26, \c,\=2, {a 

\ci2\ = l, \cu\=2, |C,4|=6, |fi5|=24, |c-,6| = 17 


= 10, \C7\=9, \cs\ = 6, k9| = 6, |f,o| = l, k-ii| = l. 




1.00 


7 


Id = 834, |f2|=3, k3l = 12, k4|=26, k-5|=27, \c6 


= 14, k7l=3 



The optimal numbers of clusters (for three data-sets) for a clustering solution P is represented by the set P = {ci,C2, ■ . ■ c]p\}, where \P\ is the optimal number of clusters 
in the data. 

doi:l 0.1 371 /journal.pone.0083956.t002 



In order to decide whether the correlation between two 
descriptors is strong, we choose a cut-ofF threshold a.,,,,,, = 0.99. 
If for the summary-statistic between two descriptors holds the 
inequality T(iJ)> =asum, then we define two descriptors to be 
strongly correlated with each other. The descriptors in the set S 



Table 3. The descriptors in predicted clusters (rows) overlapping with different categ 


ories of descriptors. 
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Figure 4. Levelplot of the correlation between the subset S for the three data sets, MS2265 (left), C\s (right), Ng (bottom). 

doi:1 0.1 371/journal.pone.0083956.g004 



have been chosen as follows. Suppose a descriptor Z),- has a 
maximum number of summary-statistics greater or equal oCsi,,,, (i.e., 
#(T(iJ)> =oisum), where j =1 . . . . . .\D\), then the 

descriptor Z),- is ranked first, and Dj is included in the subset S. 
Then we remove the descriptor D,- and the other descriptors with 
which Z), has summary-statistic >(tsimi- Then, we apply the same 
procedure to the remaining descriptors until we find any 
descriptor having maximum number of summary-statistics with 
remaining descriptors > 

^sum • Note that some of the descriptors do 
not have any summary-statistic greater than > a^um with any of the 
other descriptors. These descriptors are described as lowly 
correlated descriptors and such descriptors are also included in 
the subset S. 

This procedure reduces |D| descriptors to l^l descriptors. That 
means starting with a set of Z) descriptors, we hypothesize that the 
set S identify structural properties of a graph class distinctly. The 



remaining |Z>| — IS] descriptors are showing stronger similarity 
(correlation) with at least one of the descriptor of set S. 

Interpretation of the Results 

The clustering of descriptors for three datasets is shown by 
Figure 1 . In this figure, the six categories of descriptors are shown 
in different colors. The figure indicates that the descriptors of each 
categories have not been clustered correctly regarding their 
respective groups. For the external validity of the resulting 
clustering solution, we estimated NMI„,„x (normalized mutual 
information) [26] between reference cluster, RC = {c\,C2,Ci,Ci„Ci, 
cg} (the descriptors of six categories, |i^C|=6, and {|ci| = 
24,|C2|=301,|C3| = 57,|C4|=28,|C5|=40,|C6|=469} are consid- 
ered as the groups of the reference cluster) and the number of 
clusters of the clustering solution by cutting at different heights. 
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Table 4. Given the subset S; then, the remaining \D\ — \S\ descriptors have at least one pair for which the summary statistic T(iJ) 
is greater than a„„„=0.99 with |5| descriptors. 





Data-set 


Names of the descriptors 


MS2165 


SM3_L, H_Dt, AVS_B.v., SM02_EA.dm., Eigl 1_AEA.bo., SpMAD_AEA.ed., CIC2, Eigl3_AEA.bo., AVS_B.s., SM06_AEA.dm., Eigl4_AEA.dm., 
MAXDP, J_Dz.v., BIC4, SpDiam_AEA.dm., SpMAD_X, PJI2, SpPosA_B.m., IDDE 


C,5 


SM2_B.s-, PW4, Chil_EA.ri., SM02_EA.dm., VE1_A, IC2, CENT, SM13_AEA.bo., Eig03_EA.bo., SM03_AEA.dm., VE3_Dz.p., piPCOS, 
Eig04_AEA.bo, SpDiam_AEA.dm., piPC06, Eig02_AEA.dm, IVDE, MAXDP, PJI2, Eig05_AEA.dm., ChiO_EA.dm., Eig07_AEA.ed. 




QW_L, TIE, VE3_B.i., BIC1, VE3_Dz.i., EiglO_AEA.dm., SpPosLog.B.m., SM03_AEA.dm., Eigl 1_AEA.ri., SM04_AEA.dm., CSI, VEl_Dt, 
Eig08_EA.ed., SpMaxA_AEA.bo., Yindex, Ram, IVDE, Chi1_EA.dm 



doi:l 0.1 371 /journal.pone.0083956.t004 



The estimated normalized mutual information is calculated by 
sampling the data 5 = 200 times. Results for the three data-sets 
(average jVM/) are shown in Figure 2. The average normalized 
mutual information plot between the reference cluster and the 
clusters created by performing average hierarchical clustering 
shows that they are quite dissimilar, that is the predicted clusters 
and the reference cluster are not similar at all. Also, the descriptors 
of different categories are strongly correlated with each other. 

Next, we predict the optimal number of clusters, P={ci,C2, . . . 
C|/>|} by using consensus indices measure for different number of 
clusters generated by a clustering solution. The plots for the 
consensus indices for the three data sets are shown in Figure 3. 
The consensus indices are calculated for k = 2, ...,200 clusters. 
CI for different number of clusters for the three data-sets does not 
show an absolute maximum. Therefore we selected the first local 
maxima which gives the optimal number of clusters. The optimal 
number of clusters are shown with a dotted red line in the Figure 3. 
The consensus indices {CI) for the optimal number of clusters (|P|) 
and the total number of descriptors (|c,|, where i=\, . . . ,\P\) in 
each cluster for the three data-sets, MSjus, Cis and A^g are shown 
in Table 2. The optimal number of clusters are very litde for all 
three data-sets and for all data-sets. The first cluster is the largest 
one which contains more than 80% of 919 descriptors. The 
cardinalities of the remaining clusters are smaller as they contain 
much less descriptors. The largest cluster for all three datasets 
contains descriptors from all six categories which means that most 
of the descriptors from different categories have a strong 
correlation among the descriptors and, therefore, they measure 
structural information similarly. 

As a next step, we examine the so-called overlap between the 
optimal number of clusters shown in Table 2 and the six categories 
of descriptors. That means we have to determine how many 
different descriptors are distributed over different groups (belong- 
ing to the optimal number of clusters). This distribution over 



Table 5. The number of descriptors of S which belong to six 
different categories by using three data sets. 





Descriptor category 




C,5 




Connectivity indices 


0 


0 


0 


Edge adjacency indices 


7 


11 


7 


Topological indices 


2 


4 


3 


Walk path counts 


0 


2 


0 


Information indices 


3 


2 


3 


2D Matrix-based 


7 


3 


5 



doi:l 0.1 371 /journal.pone.0083956.t005 



diflFerent clusters could give some information namely which 
category might capture structural information of the graphs more 
uniquely than others. The results are shown in Table 3 and we are 
going to interpret these results as follows. The intersection of the 
descriptors between the optimal clusters and the categories of 
descriptors show that the edge adjacency indices are grouped into 
diflFerent cluster for all three data-sets in comparison to the 
remaining categories. The 2D Matrix-based descriptors are 
grouped into diflFerent clusters by using C15 and Ag. The 
information indices are grouped into two diflFerent clusters by 
using all three data-sets. The measures from the category walk 
path counts and topological indices are grouped into diflFerent 
clusters by using C15 only. This shows that these descriptors 
behave differently on trees. The overlap indicates that the group of 
edge adjacency indices contains more descriptors which capture 
structural information of the graphs differently compared to other 
categories. 

Next, we find a subset of descriptors S^D, |Z>| = 919. The 
main idea is to find a smaller set of descriptors which are litde 
correlated with each and, hence, those graph measures captures 
structural information uniquely. If they would be strongly 
correlated, they would capture similar structural information of 
the graphs. Importantly, the remaining descriptors have much 
stronger correlation with them. The procedure to obtain a subset 
of descriptors SsD is described in the section 'Methods and 
Results'. We obtained |5| = {19,22,18} for M52265,Ci5,A8 
datasets shown in Table 4. The levelplot of correlation for the 
subset of descriptors of three data-sets are shown in Figure 4. For 
all three data-sets, we can clearly see that the descriptors of these 
subsets are not strongly correlated. These subset of descriptors for 
all three data-set might detect structural features of the molecular 
graphs uniquely. 

Moreover we now examine for all data-sets which descriptors 
from S (shown in Table 4) belong to which group out of the six 
categories of descriptors. The results are summarized in Table 5. 
For each data-set, we start with a different number of descriptors 
for the different categories. The subset S does not contain any 
descriptor from the connectivity indices for all three data-sets, 
however, only two descriptors from walk path comb are contained 
in S by using C15. Two, four and three descriptors from the 
category topological indices are contained in S for all three data-sets. 
Three, two and three descriptors from the category information 
indices are in S for three data-sets. Seven, three and three 
descriptors from the category 2D Matrix-based are in S for three 
data-sets. Seven, eleven and seven descriptors from the category 
edge adjacency indices are in S for MS2265, C\s, Ag. These are the 
maximal numbers of descriptors compared to other categories of 
descriptors. The large occurrence of the descriptors from the 
category edge adjacency indices shows again that these descriptors 
quantify structural information more uniquely than others. 
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Table 6. The overia 


p between S and the predicted clusters (rows). 




MS22f,5 


Number of cluster 


Descriptors of S 


1 SpMAD_AEA.ed., SpDiam_AEA.dm., Eigl3_AEA.bo., Eigl4_AEA.dm., MAXDP, IDDE, SM3_L, SpMAD_X, H_Dt, J_Dz.v., SpPosA_B.m., AVS_B.v., 
AVS_B.s. 


2 


SM02_EA.dm. 


3 


SM06_AEA.dm., Eigl l_AEA.bo. 


4 


PJI2 


5 


CIC2, BIC4 


Cl5 


1 SpDiam_AEA.dm., Eig03_EA.bo., Eig07_AEA.ed., Eig02_AEA.dm., PW4, IC2, SM2_B.s. 


2 


Chil_EA.ri. 


3 


CENT, piPCOS 


4 


SM02_EA.dm. 


5 


ChiO_EA.dm. 


6 


Eig04_AEA.bo. 


7 


SM13_AEA.bo. 


8 


SM03_AEA.dnn. 


9 


10 


Eig05_AEA.dm. 


11 


PJI2 


12 


MAXDP 


13 


piPCOe 


14 


IVDE 


15 


VE1_A 


16 


VE3_Dz.p. 


Ns 


1 


Eig08_EA.ed., Eig10_AEA.dm., Eigl l_AEA.ri., CSI, TIE, Yindex, QW_L, SplVlaxA_AEA.bo., IVDE, SpPosLog_B.m. 


2 


Chi1_EA.dm., Ram 


3 


SM03_AEA.dm., SM04_AEA.dm. 


4 


BICl 


5 


VE3_B.i. 


6 VE3_Dz.i. 


7 


VEl_Dt 


doi:l 0.1 371 /journal.pone.0083956.t006 



Also, we examine the overlap between the descriptors from S 
and the descriptors in the found clusters; the intersections between 
them are shown in Table 6. Interestingly, at least one descriptor 
(for all data-sets) overlap with the descriptors of each cluster, 
except for the ninth cluster by using C15. The overlap with the 
found clusters show that the measures contained in S (for three 
data-sets) have the potential to quantify unique structural features 
of molecular graphs. 

Summary and Conclusions 

In this paper, we have evaluated 919 Dragon descriptors to 
investigate to what extent these measures quantify structural 
information of molecular graphs uniquely. From our analysis, it is 
clear that measures which are strongly correlated are not useful as 
they capture structural information similarly. From this, the 
question of determining the usefulness or quality of topological 
indices arises. 



We found by calculating the information-theoretic quantity 
JVMI that the used six categories of descriptors are strongly 
correlated with other categories of descriptors. This indicates that 
despite being categorized into different groups, these descriptors 
are providing similar information. From this, one can conclude 
that many of them they have been introduced in an unconsidered 
manner. Again, the question how useful such indices are seems to 
be quite important and deserves further attention. 

By using all three data sets, the most suitable descriptor subset S 
contains those measures which have the largest number of 
significant correlations with the remaining descriptors but they 
are not significantly correlated with each other. S forms a reduced 
set of descriptors (the original sets contains 919 descriptors) and 
their sizes are feasible approximations of the effective dimension of 
the descriptor space by using all three datasets. For each individual 
data set, we found the size of S to be 19 (M>S'2265 dataset), 18 {Ng 
dataset) and 22 (C15 dataset). Because most of the descriptors we 
have used are redundant, i.e., they are highly correlated, the 
estimation of the effective dimension is an intriguing problem. In 
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our context, the dimension is the number of difierent descriptors 
among all. By performing our analysis, we obtained a lower bound 
on the dimension of descriptors space regarding the different 
classes. Note that these descriptors (the ones in S) depend on the 
used data set. By inspecting these subsets, we see that the majority 
thereof are from the category of the edge-adjacency indices. This 
implies that the edge-adjacency based descriptors can capture 
more structural diversity when quantifying structural properties of 
molecular graphs. As another result of this paper, we see that it 
would not be appropriate to select descriptors more or less 
randomly for QSAR problems. Neither the random selection nor 
using aU available descriptors would be appropriate as demon- 
strated in our paper. To tackle this problem, we suggested a 
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